{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "83GJJF9fAgyP"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_08_5_kaggle_project.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HL640ydsAgyQ"
   },
   "source": [
    "# T81-558: Applications of Deep Neural Networks\n",
    "**Module 8: Kaggle Data Sets**\n",
    "* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "a4ih9V7vAgyR"
   },
   "source": [
    "# Module 8 Material\n",
    "\n",
    "* Part 8.1: Introduction to Kaggle [[Video]](https://www.youtube.com/watch?v=7Mk46fb0Ayg&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_1_kaggle_intro.ipynb)\n",
    "* Part 8.2: Building Ensembles with Scikit-Learn and PyTorch [[Video]](https://www.youtube.com/watch?v=przbLRCRL24&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_2_pytorch_ensembles.ipynb)\n",
    "* Part 8.3: How Should you Architect Your PyTorch Neural Network: Hyperparameters [[Video]](https://www.youtube.com/watch?v=YTL2BR4U2Ng&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_3_pytorch_hyperparameters.ipynb)\n",
    "* Part 8.4: Bayesian Hyperparameter Optimization for PyTorch [[Video]](https://www.youtube.com/watch?v=1f4psgAcefU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_4_bayesian_hyperparameter_opt.ipynb)\n",
    "* **Part 8.5: Current Semester's Kaggle** [[Video]] [[Notebook]](t81_558_class_08_5_kaggle_project.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uU7OTe1DAgyR"
   },
   "source": [
    "# Google CoLab Instructions\n",
    "\n",
    "The following code ensures that Google CoLab is running the correct version of TensorFlow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "NOdFRzaXAgyS",
    "outputId": "2475bc8b-19b2-487a-916a-3667060e76cf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: using Google CoLab\n",
      "Using device: mps\n"
     ]
    }
   ],
   "source": [
    "# Start CoLab\n",
    "try:\n",
    "    COLAB = True\n",
    "    print(\"Note: using Google CoLab\")\n",
    "except:\n",
    "    print(\"Note: not using Google CoLab\")\n",
    "    COLAB = False\n",
    "\n",
    "# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)\n",
    "import torch\n",
    "device = (\n",
    "    \"mps\"\n",
    "    if getattr(torch, \"has_mps\", False)\n",
    "    else \"cuda\"\n",
    "    if torch.cuda.is_available()\n",
    "    else \"cpu\"\n",
    ")\n",
    "print(f\"Using device: {device}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LFMTMsOWAgyS"
   },
   "source": [
    "# Part 8.5: Current Semester's Kaggle\n",
    "\n",
    "Kaggke competition site for current semester:\n",
    "* [Fall 2023 Kaggle Assignment](https://www.kaggle.com/competitions/applications-of-deep-learning-wustl-fall-2023/overview)\n",
    "\n",
    "Previous Kaggle competition sites for this class (NOT this semester's assignment, feel free to use code):\n",
    "* [Spring 2023 Kaggle Assignment](https://www.kaggle.com/competitions/applications-of-deep-learning-wustlspring-2023)\n",
    "* [Fall 2022 Kaggle Assignment](https://www.kaggle.com/competitions/applications-of-deep-learning-wustlfall-2022)\n",
    "* [Spring 2022 Kaggle Assignment](https://www.kaggle.com/c/tsp-cv)\n",
    "* [Fall 2021 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustlfall-2021)\n",
    "* [Spring 2021 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustl-spring-2021b)\n",
    "* [Fall 2020 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustl-fall-2020)\n",
    "* [Spring 2020 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2020)\n",
    "* [Fall 2019 Kaggle Assignment](https://kaggle.com/c/applications-of-deep-learningwustl-fall-2019)\n",
    "* [Spring 2019 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2019)\n",
    "* [Fall 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2018)\n",
    "* [Spring 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-spring-2018)\n",
    "* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)\n",
    "* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)\n",
    "* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p4eUCyQaAgyT"
   },
   "source": [
    "## Iris as a Kaggle Competition\n",
    "\n",
    "If I used the Iris data as a Kaggle, I would give you the following three files:\n",
    "\n",
    "* [kaggle_iris_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on. It contains only input; you must provide answers.  (contains x)\n",
    "* [kaggle_iris_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)\n",
    "* [kaggle_iris_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)\n",
    "\n",
    "Important features of the Kaggle iris files (that differ from how we've previously seen files):\n",
    "\n",
    "* The iris species is already index encoded.\n",
    "* Your training data is in a separate file.\n",
    "* You will load the test data to generate a submission file.\n",
    "\n",
    "The following program generates a submission file for \"Iris Kaggle\". You can use it as a starting point for assignment 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "FoBv4ji_AgyT",
    "outputId": "c9fc1ce4-aab9-4190-e539-31b5a1559a16"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of classes: 3\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "from torch.utils.data import DataLoader, TensorDataset\n",
    "from sklearn import metrics\n",
    "import numpy as np\n",
    "\n",
    "# Read the data\n",
    "df_train = pd.read_csv(\n",
    "    \"https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv\", na_values=['NA', '?'])\n",
    "\n",
    "# Encode feature vector\n",
    "df_train.drop('id', axis=1, inplace=True)\n",
    "\n",
    "num_classes = len(df_train.groupby('species').species.nunique())\n",
    "print(\"Number of classes: {}\".format(num_classes))\n",
    "\n",
    "# Convert to numpy - Classification\n",
    "x = df_train[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values\n",
    "dummies = pd.get_dummies(df_train['species'])  # Classification\n",
    "species = dummies.columns\n",
    "y = dummies.values\n",
    "\n",
    "# Split into train/test\n",
    "x_train, x_test, y_train, y_test = train_test_split(\n",
    "    x, y, test_size=0.25, random_state=45)\n",
    "\n",
    "# Convert to PyTorch tensors\n",
    "x_train, y_train = torch.tensor(x_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32)\n",
    "x_test, y_test = torch.tensor(x_test, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32)\n",
    "\n",
    "# Define the model using torch.nn.Sequential\n",
    "model = nn.Sequential(\n",
    "    nn.Linear(x.shape[1], 50),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(50, 25),\n",
    "    nn.Linear(25, y.shape[1]),\n",
    "    nn.Softmax(dim=1)\n",
    ")\n",
    "\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "loss_fn = nn.CrossEntropyLoss()\n",
    "\n",
    "# Training loop with early stopping\n",
    "n_epochs = 1000\n",
    "patience = 5\n",
    "best_loss = float('inf')\n",
    "early_stopping_counter = 0\n",
    "\n",
    "for epoch in range(n_epochs):\n",
    "    # Train\n",
    "    model.train()\n",
    "    optimizer.zero_grad()\n",
    "    y_pred = model(x_train)\n",
    "    loss = loss_fn(y_pred, torch.argmax(y_train, 1))\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "\n",
    "    # Validate\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "        y_val_pred = model(x_test)\n",
    "        val_loss = loss_fn(y_val_pred, torch.argmax(y_test, 1))\n",
    "    \n",
    "    if val_loss < best_loss:\n",
    "        best_loss = val_loss\n",
    "        early_stopping_counter = 0\n",
    "    else:\n",
    "        early_stopping_counter += 1\n",
    "        if early_stopping_counter >= patience:\n",
    "            print(\"Early Stopping!\")\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "u5A6iWVhAgyU"
   },
   "source": [
    "Now that we've trained the neural network, we can check its log loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dX2DIswHAgyU",
    "outputId": "79b55679-114e-4ff9-8b72-00ddb8a65746"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Log loss score: 0.015514996025663396\n"
     ]
    }
   ],
   "source": [
    "# Calculate multi log loss error\n",
    "model.eval()\n",
    "with torch.no_grad():\n",
    "    y_pred = model(x_test)\n",
    "    y_pred = y_pred.numpy()\n",
    "score = metrics.log_loss(y_test, y_pred)\n",
    "print(\"Log loss score: {}\".format(score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Hmf6QKjdAgyV"
   },
   "source": [
    "Now we are ready to generate the Kaggle submission file.  We will use the iris test data that does not contain a $y$ target value.  It is our job to predict this value and submit it to Kaggle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Fc5roTyDAgyV",
    "outputId": "c1fcbc80-4d56-4ff5-a353-dd45d3bb760d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    id     species-0  species-1     species-2\n",
      "0  100  5.431684e-05   0.999945  3.297705e-07\n",
      "1  101  6.042830e-09   0.010619  9.893807e-01\n",
      "2  102  6.944081e-10   0.000963  9.990373e-01\n",
      "3  103  9.997644e-01   0.000236  2.038801e-36\n",
      "4  104  9.998689e-01   0.000131  3.686617e-37\n"
     ]
    }
   ],
   "source": [
    "# Generate Kaggle submit file\n",
    "df_test = pd.read_csv(\n",
    "    \"https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv\", na_values=['NA', '?'])\n",
    "\n",
    "# Convert to numpy - Classification\n",
    "ids = df_test['id']\n",
    "df_test.drop('id', axis=1, inplace=True)\n",
    "x_kaggle = df_test[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values\n",
    "x_kaggle = torch.tensor(x_kaggle, dtype=torch.float32)\n",
    "\n",
    "# Generate predictions\n",
    "model.eval()\n",
    "with torch.no_grad():\n",
    "    pred_kaggle = model(x_kaggle)\n",
    "pred_kaggle = pred_kaggle.numpy()\n",
    "\n",
    "# Create submission data set\n",
    "df_submit = pd.DataFrame(pred_kaggle)\n",
    "df_submit.insert(0, 'id', ids)\n",
    "df_submit.columns = ['id', 'species-0', 'species-1', 'species-2']\n",
    "\n",
    "# Write submit file locally\n",
    "df_submit.to_csv(\"iris_submit.csv\", index=False)\n",
    "\n",
    "print(df_submit.head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Mw5ZEszvAgyV"
   },
   "source": [
    "## MPG as a Kaggle Competition (Regression)\n",
    "\n",
    "If the Auto MPG data were used as a Kaggle, you would be given the following three files:\n",
    "\n",
    "* [kaggle_mpg_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)\n",
    "* [kaggle_mpg_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that you will use to train. (contains x and y)\n",
    "* [kaggle_mpg_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_sample.csv) - A sample submission for Kaggle. (contains x and y)\n",
    "\n",
    "Important features of the Kaggle iris files (that differ from how we've previously seen files):\n",
    "\n",
    "The following program generates a submission file for \"MPG Kaggle\".  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "JjZ1Q_HpAgyV",
    "outputId": "00ed3905-be90-4a2e-9834-6cd57ac042c2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Early stopping\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "from sklearn.model_selection import train_test_split\n",
    "import pandas as pd\n",
    "import io\n",
    "import os\n",
    "import requests\n",
    "import numpy as np\n",
    "from sklearn import metrics\n",
    "\n",
    "# Download and preprocess data\n",
    "save_path = \".\"\n",
    "df = pd.read_csv(\"https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_train.csv\", na_values=['NA', '?'])\n",
    "cars = df['name']\n",
    "df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())\n",
    "\n",
    "x = df[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']].values\n",
    "y = df['mpg'].values\n",
    "\n",
    "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)\n",
    "\n",
    "# Convert numpy arrays to PyTorch tensors\n",
    "x_train, x_test, y_train, y_test = map(torch.tensor, (x_train, x_test, y_train, y_test))\n",
    "x_train, x_test = x_train.float(), x_test.float()\n",
    "y_train, y_test = y_train.float().unsqueeze(1), y_test.float().unsqueeze(1)\n",
    "\n",
    "# Define the neural network using Sequential\n",
    "model = nn.Sequential(\n",
    "    nn.Linear(x_train.shape[1], 25),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(25, 10),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(10, 1)\n",
    ")\n",
    "\n",
    "# Define loss and optimizer\n",
    "criterion = nn.MSELoss()\n",
    "optimizer = optim.Adam(model.parameters())\n",
    "\n",
    "# Early stopping criteria\n",
    "min_delta = 1e-3\n",
    "patience = 5\n",
    "best_loss = float('inf')\n",
    "count = 0\n",
    "\n",
    "# Training loop\n",
    "for epoch in range(1000):\n",
    "    model.train()\n",
    "    optimizer.zero_grad()\n",
    "    outputs = model(x_train)\n",
    "    loss = criterion(outputs, y_train)\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "\n",
    "    with torch.no_grad():\n",
    "        model.eval()\n",
    "        val_outputs = model(x_test)\n",
    "        val_loss = criterion(val_outputs, y_test)\n",
    "        if val_loss < best_loss - min_delta:\n",
    "            best_loss = val_loss\n",
    "            count = 0\n",
    "        else:\n",
    "            count += 1\n",
    "        if count > patience:\n",
    "            print(\"Early stopping\")\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BFJcZDy6AgyV"
   },
   "source": [
    "Now that we've trained the neural network, we can check its RMSE error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-8zshQm0AgyV",
    "outputId": "b5e8d691-798b-445e-f44b-a997fad1ab6b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Final score (RMSE): 13.760814666748047\n"
     ]
    }
   ],
   "source": [
    "# Predict\n",
    "model.eval()\n",
    "with torch.no_grad():\n",
    "    pred = model(x_test)\n",
    "\n",
    "# Measure RMSE\n",
    "score = torch.sqrt(criterion(pred, y_test))\n",
    "print(\"Final score (RMSE):\", score.item())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZQf79HgwAgyW"
   },
   "source": [
    "Now we are ready to generate the Kaggle submission file.  We will use the MPG test data that does not contain a $y$ target value.  It is our job to predict this value and submit it to Kaggle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Y16gAEzkAgyW",
    "outputId": "fa7a3a20-f462-48b0-f154-a4b7eeaa66f3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Final score (RMSE): 13.760814666748047\n",
      "    id        mpg\n",
      "0  350   9.085001\n",
      "1  351  10.218105\n",
      "2  352   9.354208\n",
      "3  353  11.105295\n",
      "4  354  10.152960\n"
     ]
    }
   ],
   "source": [
    "# Measure RMSE\n",
    "score = torch.sqrt(criterion(pred, y_test))\n",
    "print(\"Final score (RMSE):\", score.item())\n",
    "\n",
    "# Predict on the Kaggle test set\n",
    "df_test = pd.read_csv(\"https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv\", na_values=['NA', '?'])\n",
    "ids = df_test['id']\n",
    "df_test.drop('id', axis=1, inplace=True)\n",
    "df_test['horsepower'] = df_test['horsepower'].fillna(df['horsepower'].median())\n",
    "x = torch.tensor(df_test[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']].values).float()\n",
    "\n",
    "with torch.no_grad():\n",
    "    predictions = model(x)\n",
    "\n",
    "# Prepare submission\n",
    "df_submit = pd.DataFrame(predictions.numpy(), columns=['mpg'])\n",
    "df_submit.insert(0, 'id', ids)\n",
    "df_submit.to_csv(\"auto_submit.csv\", index=False)\n",
    "print(df_submit.head())"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "colab": {
   "collapsed_sections": [],
   "name": "Copy of t81_558_class_08_5_kaggle_project.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3.9 (torch)",
   "language": "python",
   "name": "pytorch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
